Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added parameters and exception behaviour to pqdm #792

Merged
merged 11 commits into from
Nov 4, 2024

Conversation

Sherwin-14
Copy link
Contributor

@Sherwin-14 Sherwin-14 commented Aug 24, 2024

I did all the steps outlined in the issue #581, we can now build upon this.


📚 Documentation preview 📚: https://earthaccess--792.org.readthedocs.build/en/792/

Copy link
Collaborator

@chuckwondo chuckwondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work on this PR @Sherwin-14!

There are still several places where you have to expose the new fail_fast parameter. So far, you have added it only to "protected" methods, so it is not yet exposed to the user in the "public" function and methods. You need to keep following the call chains up to the top (public) API so that users can use the parameter, if they want to choose False instead of the default True.

For example, you added fail_fast to Store._open_files, but you have not added it to all places that call Store._open_files. You did add it to Store._open_urls, which calls _open_files, but you did not yet add it to Store._open_granules, which also calls _open_files.

Further, both _open_files and _open_granules are dispatched when Store._open is called, so you must find all places where Store._open is called. In this case, it is called from Store.open, which is a public method, so you must add fail_fast to Store.open so the user can override the default, if they wish.

Further still, Store.open is called by the open top-level function in the api module, so you must add fail_fast there as well.

You must follow all such call chains from all points where pqdm is called, all the way up to the top of the call chains (up to the public functions and methods), adding fail_fast to each function and method along all such call chains.

@chuckwondo
Copy link
Collaborator

@mfisher87, I see lots of Ruff error messages regarding a number of files that @Sherwin-14 has not touched in this PR. Is there something already on main to address these errors, such that Sherwin simply needs to merge main into his own branch to silence them?

@mfisher87
Copy link
Collaborator

Not yet, but that should be an easy fix. I'll take care of it now :) #788

@mfisher87
Copy link
Collaborator

#793

@chuckwondo
Copy link
Collaborator

Not yet, but that should be an easy fix. I'll take care of it now :) #788
#793

Excellent! Thank you for the quick fix.

@Sherwin-14, Matt's fix is now merged into main, so please update your branch with the fix from main, which will eliminate the ruff errors in your build.

@chuckwondo
Copy link
Collaborator

@Sherwin-14, do you have enough information to proceed with the changes I requested?

@Sherwin-14
Copy link
Contributor Author

@Sherwin-14, do you have enough information to proceed with the changes I requested?

Yeah I have the info I needed, I had put this on hold since I was preoccupied with some other things for the last couple of days. I'll resume my work on this asap.

@betolink
Copy link
Member

betolink commented Sep 3, 2024

Would it be possible and in scope of this PR to use **kwargs and pass them to pqdm for both open() and download() with sane defaults? Users could be in control of the failing behavior but also the number of threads to use and the verbosity (don't display the progress bar). I'm not sure about the style, in Xarray they use **kwargs and an equivalent backend_kwargs={} that in our case could be something like pqdm_opts={}

# code in api.py and something similar in store.py
# in this example we leave threads out to ensure backwards compatibility and will use it in the store class if present
def open(granules, threads, provider, pqdm_opts):
    default_pqdm_opts = {
      n_jobs=threads,
      exception_behaviour='immediate'
      colour="green",
      disable=False # True means no output including the progress bar.
    }

    if not pqdm_opts;
        pqdm_opts = default_pqdm_opts
    earthaccess.__store__.open(granules, treads, pqdm_opts)

and actually we should do the same with fsspec so we let users tune their caching behavior when opening the files, aka a "smart_open". What do you all think?

@mfisher87
Copy link
Collaborator

I had put this on hold since I was preoccupied with some other things for the last couple of days. I'll resume my work on this asap.

@Sherwin-14 No rush! We really appreciate any time you can spare, and don't want to ask for more than that :)

@mfisher87
Copy link
Collaborator

@betolink I really like that idea. Basically "Allow customizing all pqdm behaviors" (and separately, "Allow customizing all fsspec behaviors")?

I'm not sure how I feel about including that as part of this PR. I lean towards separate out of respect for @Sherwin-14 's time, but I don't want to speak for you, Sherwin!

@mfisher87
Copy link
Collaborator

mfisher87 commented Sep 3, 2024

We decided during hack day today that we want to allow users to customize pqdm behaviors like so:

def open(something, threads=16, *, pqdm_kwargs={...}, ...):
    pqdm_kwargs_defaults = {"exception_behavior": "immediate", "threads": threads}
    pqdm_kwargs = pqdm_kwargs_defaults.update(pqdm_kwargs)
    ...

We considered the threads argument as fundamental to the function of earthaccess, so we think it should remain a top-level parameter. We'll use threads whether we use pqdm or not in 5 years.

@Sherwin-14
Copy link
Contributor Author

Sherwin-14 commented Sep 5, 2024

We decided during hack day today that we want to allow users to customize pqdm behaviors like so:

def open(something, threads=16, *, pqdm_kwargs={...}, ...):
    pqdm_kwargs_defaults = {"exception_behavior": "immediate", "threads": threads}
    pqdm_kwargs = pqdm_kwargs_defaults.update(pqdm_kwargs)
    ...

We considered the threads argument as fundamental to the function of earthaccess, so we think it should remain a top-level parameter. We'll use threads whether we use pqdm or not in 5 years.

How should we now proceed with the changes we discussed during the hackday?

@chuckwondo
Copy link
Collaborator

We decided during hack day today that we want to allow users to customize pqdm behaviors like so:

def open(something, threads=16, *, pqdm_kwargs={...}, ...):
    pqdm_kwargs_defaults = {"exception_behavior": "immediate", "threads": threads}
    pqdm_kwargs = pqdm_kwargs_defaults.update(pqdm_kwargs)
    ...

We considered the threads argument as fundamental to the function of earthaccess, so we think it should remain a top-level parameter. We'll use threads whether we use pqdm or not in 5 years.

How should we now proceed with the changes we discussed during the hackday?

Do the same thing you did in the PR you already started for this, but instead of fail_fast (which is what you pretty much completed) change it to pqdm_kwargs and adjust your code changes similar to what is shown above, although the code above won't work because pqdm has an n_jobs arg, not a threads arg, so I suggest something like so, instead, within the functions/methods that invoke pqdm:

pqdm_kwargs = {
    "exception_behavior": "immediate",
    "n_jobs": threads,
    **pqdm_kwargs,
}

...

pqdm(<other args>, **pqdm_kwargs)

@mfisher87
Copy link
Collaborator

Great point! Much clearer with the double splat too ;)

@chuckwondo
Copy link
Collaborator

@Sherwin-14, do you need some help with this?

@Sherwin-14
Copy link
Contributor Author

@Sherwin-14, do you need some help with this?

I will try to accomodate the change asap, I was caught up in something else up untill now

@chuckwondo
Copy link
Collaborator

@Sherwin-14, do you need some help with this?

I will try to accomodate the change asap, I was caught up in something else up untill now

No rush. I just wanted to check in to see if you need any help or if something else was requiring you to postpone your continuation.

@Sherwin-14
Copy link
Contributor Author

@chuckwondo

def download(
    granules: Union[DataGranule, List[DataGranule], str, List[str]],
    local_path: Optional[str],
    provider: Optional[str] = None,
    threads: int = 8,
    pqdm_kwargs: dict = {}
) -> List[str]:
    """Retrieves data granules from a remote storage system.

       * If we run this in the cloud, we will be using S3 to move data to `local_path`.
       * If we run it outside AWS (us-west-2 region) and the dataset is cloud hosted,
            we'll use HTTP links.

    Parameters:
        granules: a granule, list of granules, a granule link (HTTP), or a list of granule links (HTTP)
        local_path: local directory to store the remote data granules
        provider: if we download a list of URLs, we need to specify the provider.
        threads: parallel number of threads to use to download the files, adjust as necessary, default = 8

    Returns:
        List of downloaded files

    Raises:
        Exception: A file download failed.
    """
    provider = _normalize_location(provider)
    pqdm_kwargs = {
        "exception_behavior": "immediate",
        "n_jobs": threads,
        **pqdm_kwargs,
    }
    if isinstance(granules, DataGranule):
        granules = [granules]
    elif isinstance(granules, str):
        granules = [granules]
    try:
        results = earthaccess.__store__.get(
            granules, local_path, provider, threads, pqdm_kwargs
        )    
    except AttributeError as err:
        logger.error(
            f"{err}: You must call earthaccess.login() before you can download data"
        )
        return []
    return results

Just for clarification, the public API function download would look something like this. The pqdm_kwargs would be taken in at the user interface and then passed on to the lower level functions in the entire call chain isn't it? or am I missing something?

@chuckwondo
Copy link
Collaborator

Correct, but you almost never want to use mutable values for default values, and this is one of those cases (among other references, see https://docs.python-guide.org/writing/gotchas/#mutable-default-arguments).

So, do NOT do this:

def download(
    granules: Union[DataGranule, List[DataGranule], str, List[str]],
    local_path: Optional[str],
    provider: Optional[str] = None,
    threads: int = 8,
    pqdm_kwargs: dict = {}
) -> List[str]:
    pqdm_kwargs = {
        "exception_behavior": "immediate",
        "n_jobs": threads,
        **pqdm_kwargs,
    }

Instead, DO THIS:

def download(
    granules: Union[DataGranule, List[DataGranule], str, List[str]],
    local_path: Optional[str],
    provider: Optional[str] = None,
    threads: int = 8,
    pqdm_kwargs: Optional[Mapping[str, Any]] = None,
) -> List[str]:
    pqdm_kwargs = {
        "exception_behavior": "immediate",
        "n_jobs": threads,
        **(pqdm_kwargs or {}),
    }

@codecov-commenter
Copy link

codecov-commenter commented Sep 29, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 50.00000% with 27 lines in your changes missing coverage. Please review.

Project coverage is 68.78%. Comparing base (2f49c08) to head (1061743).
Report is 69 commits behind head on main.

Files with missing lines Patch % Lines
earthaccess/store.py 0.00% 22 Missing ⚠️
earthaccess/api.py 37.50% 5 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

❗ There is a different number of reports uploaded between BASE (2f49c08) and HEAD (1061743). Click for more details.

HEAD has 6 uploads less than BASE
Flag BASE (2f49c08) HEAD (1061743)
14 8
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #792      +/-   ##
==========================================
- Coverage   73.88%   68.78%   -5.11%     
==========================================
  Files          31       32       +1     
  Lines        2003     2153     +150     
==========================================
+ Hits         1480     1481       +1     
- Misses        523      672     +149     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@chuckwondo chuckwondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Sherwin-14. In addition to my other comments, please add at least 1 unit test. I don't think you need to test passing all possible pqdm kwargs, but perhaps just test that you can produce an error for downloading and have it raise an error (now that the default exception behavior is "immediate") instead of returning with a list containing errors.

earthaccess/api.py Outdated Show resolved Hide resolved
earthaccess/api.py Outdated Show resolved Hide resolved
earthaccess/store.py Outdated Show resolved Hide resolved
earthaccess/store.py Outdated Show resolved Hide resolved
earthaccess/store.py Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
Comment on lines 400 to 409
access = "direct"
provider = granules[0]["meta"]["provider-id"]
# if the data has its own S3 credentials endpoint, we will use it
endpoint = self._own_s3_credentials(granules[0]["umm"]["RelatedUrls"])
if endpoint is not None:
logger.info(f"using endpoint: {endpoint}")
s3_fs = self.get_s3_filesystem(endpoint=endpoint)
else:
logger.info(f"using provider: {provider}")
s3_fs = self.get_s3_filesystem(provider=provider)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did this come from?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sherwin-14, it seems you've marked many suggestions as resolved, but I don't see the changes. Perhaps I'm not looking in the right spot in the GitHub UI, but are you sure you pushed your changes?

earthaccess/api.py Show resolved Hide resolved
earthaccess/api.py Show resolved Hide resolved
earthaccess/store.py Show resolved Hide resolved
Copy link
Collaborator

@chuckwondo chuckwondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sherwin-14, I suspect you marked many suggestions as "resolved," but forgot to push the changes that resolve the suggestions.

Also, please remember to go back to the Conversations tab of your PR and, in the upper-right area of the tab, click the circular double-arrows next to a reviewer's name to request a re-review after the reviewer has requested changes. Otherwise, the reviewer doesn't know when you're finished making changes and ready for a follow-up review.

Clicking that icon will send another email indicating that you want another review.

image

earthaccess/store.py Show resolved Hide resolved
earthaccess/store.py Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
earthaccess/api.py Show resolved Hide resolved
earthaccess/api.py Show resolved Hide resolved
Comment on lines 400 to 409
access = "direct"
provider = granules[0]["meta"]["provider-id"]
# if the data has its own S3 credentials endpoint, we will use it
endpoint = self._own_s3_credentials(granules[0]["umm"]["RelatedUrls"])
if endpoint is not None:
logger.info(f"using endpoint: {endpoint}")
s3_fs = self.get_s3_filesystem(endpoint=endpoint)
else:
logger.info(f"using provider: {provider}")
s3_fs = self.get_s3_filesystem(provider=provider)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sherwin-14, it seems you've marked many suggestions as resolved, but I don't see the changes. Perhaps I'm not looking in the right spot in the GitHub UI, but are you sure you pushed your changes?

Comment on lines 696 to 698
fail_fast: if set to True, the download process will stop immediately
upon encountering the first error. If set to False, errors will be
deferred, allowing the download of remaining files to continue.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a change. Unresolving.

@chuckwondo
Copy link
Collaborator

@Sherwin-14, several other PRs have landed on main, so please also be sure to merge the changes from main and resolve conflicts.

@Sherwin-14
Copy link
Contributor Author

Sherwin-14 commented Oct 6, 2024

@chuckwondo Hey Chuck, I have made the appropriate changes but I haven't pushed the code yet since I was writing unittest for this one which you suggested earlier. I will re-request for a review once I feel this is ready to be reviewed

@chuckwondo
Copy link
Collaborator

@chuckwondo Hey Chuck, I have made the appropriate changes but I haven't pushed the code yet since I was writing unittest for this one which you suggested earlier. I will re-request for a review once I feel this is ready to be reviewed

Sounds good, but please wait to mark any suggestions as resolved until you push the corresponding code changes, otherwise it is confusing to others.

@Sherwin-14
Copy link
Contributor Author

Sherwin-14 commented Oct 27, 2024

class TestEarthAccessDownload(unittest.TestCase):
    def test_download(self):

        earthaccess.login()

        results = earthaccess.search_data(
            short_name='ATL06',  # ATLAS/ICESat-2 L3A Land Ice Height
            bounding_box=(-10, 20, 10, 50),  # Only include files in area of interest...
            temporal=("1999-02", "2019-03"),  # ...and time period of interest.
            count=10
        )


        with patch.object(earthaccess, '__store__', return_value= Mock()) as mock_store:
            with patch.object(mock_store, 'get') as mock_get:
    
                mock_get.side_effect = Exception('Download failed')
    
     
                files = earthaccess.download(results, "/home/download-folder")
    
                self.assertEqual(files, [])

if __name__ == '__main__':
    unittest.main()

This is the test function I came up with and this is giving me an error log that looks something like this. Is this the thing we want to achieve at this point? Also if the test is correct then where should I place it should I make a completely new file for this like test_api.py?

@chuckwondo
Copy link
Collaborator

chuckwondo commented Oct 27, 2024

Hi @Sherwin-14. Thanks for working on the unit tests for this PR.

The first thing I suggest you do is merge the latest from main into your branch, so you're working with the latest code.

Regarding the test you've written, it looks like you're generally heading in the right direction.

I have some suggestions:

  • Sure, if you want to put your new tests in a new tests/unit/test_api.py file, that sounds good to me.
  • Do not use the unittest library. There's no need for the additional "overhead". Simply use top-level test functions and built-in assert statements. We have almost completely moved away from unittest in our integration tests, and we have a mix in our unit tests, but we should continue moving away from unittest (although I suspect this may be up for debate).
  • Regarding your test shown above, that looks good, but since the new behavior is for the call to download to fail immediately upon any single download failure (i.e., pqdm exception_behavior="immediate"), then you will want to put your call to download in a context manager, as by default we expect an exception to be raised (as shown in your linked test failure output), thus we do not expect to be able to assert that we get back an empty list of results:
        with patch.object(earthaccess, '__store__', return_value= Mock()) as mock_store:
            with patch.object(mock_store, 'get') as mock_get:
                mock_get.side_effect = Exception('Download failed')

                with pytest.raises(Exception, match="Download failed"):
                    earthaccess.download(results, "/home/download-folder")

@Sherwin-14
Copy link
Contributor Author

I have taken out unittest but it seems that it is necessary to use unittest for importing the likes of Mock. Another option here would be to use the pytest mock library but that means we will need to add an additional dependency.

@chuckwondo
Copy link
Collaborator

I have taken out unittest but it seems that it is necessary to use unittest for importing the likes of Mock. Another option here would be to use the pytest mock library but that means we will need to add an additional dependency.

You can still use unittest Mocks, but there's no need for creating a subclass of unittest.TestCase.

Copy link

github-actions bot commented Oct 28, 2024

Binder 👈 Launch a binder notebook on this branch for commit d22f2ac

I will automatically update this comment whenever this PR is modified

Binder 👈 Launch a binder notebook on this branch for commit 374f303

Binder 👈 Launch a binder notebook on this branch for commit de8df1c

Binder 👈 Launch a binder notebook on this branch for commit 767e119

Binder 👈 Launch a binder notebook on this branch for commit 2d485b3

Binder 👈 Launch a binder notebook on this branch for commit 1061743

Copy link
Collaborator

@chuckwondo chuckwondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to my comments, please also make corrections to fix the pre-commit issues: https://results.pre-commit.ci/run/github/399867529/1730125212.zyU74SY_QNaQKq_PZmUrIQ

It looks like you perhaps have not installed pre-commit hooks locally, otherwise your attempt to commit your changes should have produced those same pre-commit issues locally, thus preventing your commit from completing.

Please install pre-commit, if you haven't already done so, an then run the following from your forked/cloned repo to install the hooks configured in the repo: pre-commit install --install-hooks. Then run pre-commit run -a to run the pre-commit checks against all the files, which should then help you fix the pre-commit errors.

earthaccess/api.py Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
from earthaccess import api,download
from unittest.mock import Mock

def test_download(monkeypatch):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_download(monkeypatch):
def test_download_immediate_failure(monkeypatch):

tests/unit/test_api.py Show resolved Hide resolved
Sherwin-14 and others added 2 commits October 28, 2024 22:53
Co-authored-by: Chuck Daniels <[email protected]>
Co-authored-by: Chuck Daniels <[email protected]>
Copy link
Collaborator

@chuckwondo chuckwondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you have not correctly merged/rebased from main. I see changes that I should not see. For example, in CHANGELOG.md, I see a new Fixed section, which is not new. It is already on main, so should not show up as something new on your branch.

I think there are some other spots that seem out of sync too, such as in the earthacces/store.py method _open_files, where the type of the threads parameter seems to have been clobbered. It should be threads: int = 8 (a recent change), instead of threads: Optional[int] = 8.

chuckwondo
chuckwondo previously approved these changes Oct 29, 2024
Copy link
Collaborator

@chuckwondo chuckwondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Sherwin-14.

I'm approving, but do @mfisher87, @betolink, or @jhkennedy want to take a pass over this before it gets merged?

Copy link
Collaborator

@mfisher87 mfisher87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks amazing, @Sherwin-14 ! Thank you! A little nitpick, and let's merge this!

tests/unit/test_api.py Outdated Show resolved Hide resolved
@chuckwondo chuckwondo merged commit 84be54e into nsidc:main Nov 4, 2024
13 checks passed
Copy link

github-actions bot commented Nov 5, 2024

User pre-commit-ci[bot] does not have permission to run integration tests. A maintainer must perform a security review of the code changes in this pull request and re-run the failed integration tests jobs, if the code is deemed safe.

@chuckwondo chuckwondo mentioned this pull request Nov 5, 2024
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants